Problem 1

library(tidyverse)

Consider the iris dataset. It is one of the built-in datasets in your R installation.

The data set was created by statistician Ronald Fisher in his 1936 paper “The use of multiple measurements in taxonomic problems”.

This famous iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

The built-in data set iris is a data frame with 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.

head(iris) #quick look at the best ranked ones
glimpse(iris)
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4,...
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7,...
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5,...
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2,...
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa...
  1. (1 point) Create a variable called myid that contains the result of multiplying \(\pi\) by the 3-4 digit number in your university email, and dividing by 1000.
myid <- (pi*0230)/1000 #multiplied by pi and then divide that function by 1000
myid #calling that variable 
## [1] 0.7225663
  1. (2 points) Use a histogram to explore the distribution of one of the numerical variables in the dataset.
# use ggplot to create the histogram
ggplot(y = iris$Petal.Length) + #creates plot
  geom_bar(mapping = aes(x = iris$Petal.Length, fill = iris$Petal.Length)) #wanting it to have petal length of the iris as x axis and count it based how many times it occurs. 

  1. (2 points) Create side-by-side boxplots that show the distribution of the numerical variable you chose in part (a) for the 3 different types of Species. Make sure to comment on your results.
# use ggplot to create the boxplots
ggplot(data=iris, aes(y=Petal.Length, x = Species, fill = Species))+geom_boxplot() #this is taking the data set iris and looking at the Petal Lengths for each different species. I have it colored differently depending on the species and made it into boxplots. 

This box plot indicates that the Setosa species has an incredibly small petal length range and petal length. It appears that it has a range of about 1-1.75 in length. The Veriscolor has a greater petal range and length. It appears, based on the box plot, to have a range from 4-4.75 in petal length. The Virginica has the largest petal lengths and greatest range. It ranges from 5 to ~6 in petal length! Overall, the Virginica has the largest petal lengths and range with the Setosa having the smallest of both.

  1. (2 points) Create a summary with the average Petal.Length and Petal.Width per Species.
iris %>% #data set
  group_by(Species) %>% #wanting it "per species"
  summarise(Avg_Petal_length = mean(Petal.Length), Avg_Petal_width= mean(Petal.Width)) #averaging out the petal length and width for each. 
## `summarise()` ungrouping output (override with `.groups` argument)
  1. (2 points) Create a new data frame called my_iris in which all numerical attributes have been transformed from centimeters to millimeters.
my_iris <- iris #assigning the old data set to the one I'm creating
my_iris[1:4] <- my_iris[1:4]*10 #wanting it to select the first four columns. What this does is it includes all the number values that need to be multiplied and ignores the species name as that is the 5th. State that the new 4 are equal to the old values X 10 to obtain the conversion
my_iris #print out that data set
  1. (2 points) Use the variable myid from part (a) to obtain all the observations from your my_iris data frame for which the sepal length (in millimeters) is greater than or equal to myid
  my_iris %>% #data set we just made
  filter(Sepal.Length >= myid) #filtering it to be Sepal.Length to be greater or equal to my id (which shocker, it's a lot)

Problem 2

The data for this problem contains songs from the Billboard Hot 100 list ranging from 1960 to 2015

music_path <- "https://raw.githubusercontent.com/reisanar/datasets/master/bbTop100.csv"
music_top100 <- read.csv(music_path)

Some of the variables included in this dataset are the year the song came out, the artist_name, duration (in milliseconds), among others.

Variable Name Description
year year
artist_name the artist of the song
explicit if the track is rated as explicit
track_name the name of the track
danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C#/Db, 2 = D, and so on.
loudness The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms The duration of the track in milliseconds.
glimpse(music_top100) #wanting to take a look at the names of the columns
## Rows: 5,497
## Columns: 16
## $ year             <int> 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960, 19...
## $ artist_name      <chr> "Percy Faith & His Orchestra", "Jim Reeves", "John...
## $ explicit         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
## $ track_name       <chr> "The Theme From \"A Summer Place\" - Single Versio...
## $ danceability     <dbl> 0.466, 0.554, 0.758, 0.583, 0.567, 0.635, 0.534, 0...
## $ energy           <dbl> 0.389, 0.186, 0.462, 0.168, 0.141, 0.391, 0.720, 0...
## $ key              <int> 5, 1, 5, 0, 10, 4, 10, 7, 4, 1, 7, 11, 4, 2, 7, 6,...
## $ loudness         <dbl> -12.825, -15.846, -8.952, -12.426, -14.803, -14.14...
## $ mode             <int> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,...
## $ speechiness      <dbl> 0.0253, 0.0379, 0.0482, 0.0350, 0.0315, 0.0367, 0....
## $ acousticness     <dbl> 0.6310, 0.9090, 0.7900, 0.8690, 0.9250, 0.5110, 0....
## $ instrumentalness <dbl> 8.43e-01, 1.44e-03, 1.25e-05, 0.00e+00, 4.65e-03, ...
## $ liveness         <dbl> 0.2950, 0.1100, 0.1700, 0.1480, 0.1170, 0.4930, 0....
## $ valence          <dbl> 0.745, 0.200, 0.726, 0.353, 0.315, 0.713, 0.901, 0...
## $ tempo            <dbl> 92.631, 81.181, 120.004, 97.572, 103.078, 126.267,...
## $ duration_ms      <int> 144893, 138640, 160027, 157080, 160067, 195733, 12...
  1. (1 point) Find all the songs from 2013
# use filter()
  music_top100 %>% #data set
  filter(year == "2013") #filters it by year
  1. (2 points) Make a new variable duration_min to convert from milliseconds to minutes
# hint: use the mutate() function
music_top100 <- music_top100 %>% #data set
  mutate(duration_min = (duration_ms / 60000)) #takes the conversion rate of 60,000 ms to one second

music_top100
  1. (2 points) Check the distribution of duration of songs. Use a histogram for the variable duration_min with binwidth=0.5
# hint: use geom_histogram
ggplot(y = music_top100$duration_min) + #creates a plot
  geom_bar(mapping = aes(x = music_top100$duration_min, fill = music_top100$duration_min)) + geom_histogram(binwidth = 0.5)

#similar layout to the histogram created in Problem 1 but added in the bin-width
  1. (2 points) Are there songs longer than 10 minutes? If so, what are the years when such songs were part of the Top 100 list?
# use the filter() function
music_top100 %>% #data set
  filter(duration_min > 10) %>% #wanting to filter it by the songs that are greater than 10 minutes
  select(year) #selecting the column with the inputs that are satisfied after the filter
  1. (2 points) How would you find the exact number of “explicit” songs that were part of the “Top 100 List” per year?
# number of explicit songs per year
music_top100 %>% #data set
  filter(explicit == "TRUE") %>% #selecting only those that are explicit 
  summarise(n()) #counting them

Tell us something we don’t know (Extra credit: up to 3 points)

  • This is your chance to be creative and explore the data set in a way we have not done yet.

  • Think of a question you may be able to answer with the number of observations and variables included in the dataset. Make sure to write down the question and list the steps you might need to execute in order to provide an answer to the question.

As a sound technician, people always say that the music “feels better” when it’s louder. Personally, I have also always pushed the decibels when the song had a higher tempo as it felt more energetic and “fun”. Currently, I would assume that due to cultural trends, that this increased throughout the years. As a result, I am preliminarily interested in looking at loudness and tempo of the music throughout the years.

new_loudness <- music_top100 %>% #assigned a variable to the old data set
  group_by(year) %>% #filtered by year
  summarise(loudness = mean(loudness))#grabbing the average values year over year
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(new_loudness, aes(x = year, y = loudness))+ geom_point(size = 3)+geom_smooth(method = "lm")#data set looking at loudness in comparison to the year. Adding sizable data points to make it easier to see. Added a linear regression line to the data to make it easy to extrapolate the trend.
## `geom_smooth()` using formula 'y ~ x'

This graph indicates that year over year, the average loudness of the top 100 songs have increased. My assumption is already “half correct” as we have seen a progression of db levels.

new_tempo <- music_top100 %>% #assigning this to a new data set
  group_by(year) %>% #looking at it by year
  summarise(tempo = mean(tempo)) #grabbing the average values year over year
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(new_tempo, aes(x = year, y = tempo))+ geom_point(size = 3)+geom_smooth(method = "lm") #data set looking at tempo in comparison to the year. Adding sizable data points to make it easier to see. Added a linear regression line to the data to make it easy to extrapolate the trend. 
## `geom_smooth()` using formula 'y ~ x'

This graph indicates that year over year, the average tempo of the top 100 song has no significant trend. As a result of this, my entire assumption that there was correlation between loudness and tempo throughout the years is not entirely valid. Looks like sound technicians pushing the db level was better for a top 100 song rather than the tempo of the song itself.

After seeing the db level increasing year over year, I also decided I wanted to check out the energy of these songs throughout the years. According to the data set, energetic tracks feel fast, loud, and noisy. So I decided to see how the energy levels looks.

new_energy <- music_top100 %>% #assigned new variable to old data set
  group_by(year) %>% #filtered by year
  summarise(energy = mean(energy))#grabbing the average values year over year
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(new_energy, aes(x = year, y = energy))+ geom_point(size = 3)+geom_smooth(method = "lm")#data set looking at energy in comparison to the year. Adding sizable data points to make it easier to see. Added a linear regression line to the data to make it easy to extrapolate the trend.
## `geom_smooth()` using formula 'y ~ x'

This graph indicates that year over year, the average energy of the top 100 song have increased. This makes sense as the data set indicates energetic tracks feel fast, loud, and noisy.

We saw previously that the db levels increased throughout the years, so it is understandable that the energy also is increasing. However the “feeling of speed” clearly does not match with the actual tempo of the song. So I guess I should keep pushing the db level to get an “energetic” feel regardless of tempo. :)

Also looked at average loudness and tempo with direct comparison with one another.

added <- music_top100 %>% #assigned new variable to old data set
  group_by(year) %>%
  summarise(avg_loudness = mean(loudness), avg_tempo = mean(tempo))#grabbing the average values year over year
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(added, aes(x = avg_loudness, y = avg_tempo)) + geom_point(size = 3)+geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

It hardly slopes upwards.